Team:ALMANACH

Team Almanach

Personnel

Overall Objectives

Research Program

Application Domains

Application domains of NLP and Computational Humanities

Highlights of the Year

New Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Industrial Collaborations

Partnerships and Cooperations

Dissemination

Bibliography

Inria | Raweb 2017 | Presentation of the Team ALMANACH | ALMANACH Web Site


	PDF	e-Pub

previous

Home | Next next

next

Section: New Results

Tweet processing

Participants : Éric Villemonte de La Clergerie, Djamé Seddah, Benoît Sagot.

In the context of the SoSweet and Parsiti ANR actions, we run various experiments on large amounts of tweets.

In a first experiment, around 20 millions tweets were normalized, and then parsed with FRMG. A first observation was that the current level of pre-parsing normalization was not sufficient to ensure a good parsing coverage with FRMG (around 67%, to be compared with around 93% on FTB journalistic texts), also leading to high parsing times because of correction strategies. However, error mining was tried to identify a first set of easy errors and further developments are planned to track errors more related to segmentation and normalization. Clustering and word embedding were also tried for lemmas relying on the dependency parse trees, again leading to semi-successful results due to the poor quality of the pre-parsing phases.

In a second experiment, we adapted our two clustering (DepCluster) and word embeddings (DepGlove) algorithms to take into account non-linguistic relations, such as the author-word relation (between an author and the words of her tweets). The algorithms were applied on raw tweets with only a basic tokenisation, and results produced on a month basis over 18 months (2016/02 to 2017/08). Several tools, with a special focus on Cytoscape, were tried to visualize the results as networks, in order to identify and explain communities.

previous

Home | Next next

next